Let’s consider again the cross-sectional wage data sampled from the 1976 US Population Survey1. It consists of 526 observations of average hourly wages, and various covariates such as education, race, gender, or marital status.
# uncomment the install.packages if you do not have this package
# install.packages('wooldridge')
require('wooldridge')
data('wage1')
head(wage1)
## wage educ exper tenure nonwhite female married numdep smsa northcen
## 1 3.10 11 2 0 0 1 0 2 1 0
## 2 3.24 12 22 2 0 1 1 3 1 0
## 3 3.00 11 2 0 0 0 0 2 0 0
## 4 6.00 8 44 28 0 0 1 0 1 0
## 5 5.30 12 7 2 0 0 1 1 0 0
## 6 8.75 16 9 8 0 0 1 0 1 0
## south west construc ndurman trcommpu trade services profserv profocc
## 1 0 1 0 0 0 0 0 0 0
## 2 0 1 0 0 0 0 1 0 0
## 3 0 1 0 0 0 1 0 0 0
## 4 0 1 0 0 0 0 0 0 0
## 5 0 1 0 0 0 0 0 0 0
## 6 0 1 0 0 0 0 0 1 1
## clerocc servocc lwage expersq tenursq
## 1 0 0 1.131402 4 0
## 2 0 1 1.175573 484 4
## 3 0 0 1.098612 4 0
## 4 1 0 1.791759 1936 784
## 5 0 0 1.667707 49 4
## 6 0 0 2.169054 81 64
In previous labs, we’ve seen that wages differ according to many of the explanatory variables, indicating that multiple regression is appropriate for analysis of this dataset. In this lab, we will use hypothesis testing to get a sense of which variables are significant.
library(dplyr)
library(ggplot2)
# Recall that wages vary by education, experience, gender, etc.
ggplot(wage1, aes(x= educ, y = log(wage))) + geom_point()+
geom_smooth(method = "lm", se = F)
ggplot(wage1, aes(x= tenure, y = log(wage))) + geom_point()+
geom_smooth(method = "lm", se = F)
ggplot(wage1) + geom_boxplot(aes(x=factor(female), y = log(wage))) +
labs(x = "Gender") + scale_x_discrete(labels = c("Men", "Women"))
ggplot(wage1) + geom_boxplot(aes(x=factor(nonwhite), y = log(wage))) +
labs(x = "Race") + scale_x_discrete(labels = c("White", "Nonwhite"))
Visual exploration of our variables indicates that some have greater effects than others. How can we determine which variables have statistical significance?
Recall that under the normal model \(y = X\beta + \epsilon\) with \(\epsilon \sim \mathcal{N}(0, \sigma^2 I_{n\times n})\), the mean of \(\hat\beta\) is \(\beta\), and the covariance matrix of \(\hat\beta\) is \(\sigma^2(X^TX)^{-1}\).
Thus, the variance of \(\hat\beta_j\) is \(\sigma^2\nu_{jj}\), where \(\nu_{jj}\) is the \(j\)th diagonal element of \((X^TX)^{-1}\).
Thus, if we want to test the null hypothesis \[\begin{align} H_0 : \beta_j^* = \beta_j^{(0)} \end{align}\] we use the statistic \[\begin{align} t = \frac{\hat\beta_j- \beta^*_j}{\sqrt{\frac{RSS}{(n-p-1)}} \sqrt{\nu_{jj}}} \end{align}\]which has a \(t\)-distribution with \(n - p - 1\) degrees of freedom. We can use this to test the null hypothesis. If \(|t|\) is large, then we can reject the null hypothesis.
lm() computes the standard errors for each coefficient \(\hat{\beta_j}\) and also does a \(t\)-test for each coefficient, testing \[H_0: \beta_j = 0.\]
lm() also tests the global null hypothesis that all coefficients are zero.
\[\begin{align}
H_0 : \beta_1 = \beta_2 = ... = \beta_p = 0
\end{align}\]
Under this null hypothesis, the statistic
\[\begin{align}
F_0 = \frac{RegSS / p}{RSS / (n - p - 1)}
\end{align}\]
follows an \(F\)-distribution with \(p\) and \(n - p - 1\) degrees of freedom. If \(F_0\) is large, we reject the null hypothesis.
Let’s use multiple linear regression to model log wages by education, experience, gender, and race.
model<- lm(lwage~educ+exper+female+nonwhite, data = wage1)
summary(model)
##
## Call:
## lm(formula = lwage ~ educ + exper + female + nonwhite, data = wage1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.89689 -0.26333 -0.03394 0.26654 1.28131
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.483188 0.106141 4.552 6.61e-06 ***
## educ 0.091192 0.007156 12.743 < 2e-16 ***
## exper 0.009411 0.001451 6.487 2.04e-10 ***
## female -0.343712 0.037709 -9.115 < 2e-16 ***
## nonwhite -0.009889 0.061913 -0.160 0.873
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4293 on 521 degrees of freedom
## Multiple R-squared: 0.3526, Adjusted R-squared: 0.3476
## F-statistic: 70.93 on 4 and 521 DF, p-value: < 2.2e-16
What are the estimates of each coefficient?
What are their standard errors?
What are the t-statistics and p-values? What do these tell us?
Which variables are significant?
What is the residual standard error? How is it calculated?
How can we determine if this is a good model?
What do the asterisks/“significance codes” in the last column mean?
How would we compute the information without using the lm function?
# Compute the coefficient, standard error, t-statistic, and p-value for Education:
X<- as.matrix(cbind(1, wage1[,c("educ", "exper", "female", "nonwhite")]))
y<- wage1$lwage
beta_hat<- solve(t(X) %*% X, t(X) %*% y)
V <- solve(t(X) %*% X)
yhat<- X %*% beta_hat
RSS = sum((y - yhat)^2)
RSE = sqrt(RSS/(526-4-1))
beta_educ<- beta_hat[2]
beta_educ
## [1] 0.09119174
se_educ<- RSE * sqrt(V[2,2])
se_educ
## [1] 0.00715617
t_educ<- beta_educ / se_educ
t_educ
## [1] 12.74309
pval_educ<- 2*(1-pt(t_educ, 526-4-1))
pval_educ
## [1] 0
#compare to what we got from lm():
summary(model)$coefficients[2,]
## Estimate Std. Error t value Pr(>|t|)
## 9.119174e-02 7.156170e-03 1.274309e+01 1.440082e-32
P-values are probabilities. What probability is this? How do we interpret each p-value? (Maybe you want to draw a picture)
What is the difference between the t-tests and F-test?
Why do we need t-tests if we have the F-test?
Why do we need the F-test if we can just do t-tests?
The F-test is very helpful for model comparisons when we don’t know which variables to include.
Recall that for a given model \(M\) with \(k+p\) parameters and a sub-model \(m\) with \(k\) parameters, we can test that model \(M\) is significantly better at predicting \(y\) with an F-test, given by \[ \frac{RSS(m) - RSS(M)}{RSS(M)/(n-(k+p+1))} \sim F_{p, n-(k+p+1)} \]
Let’s use the F-test to compare some potential models:
model1<- lm(lwage~educ + exper, data = wage1)
model2<- lm(lwage~educ + exper + female, data = wage1)
anova(model1, model2)
## Analysis of Variance Table
##
## Model 1: lwage ~ educ + exper
## Model 2: lwage ~ educ + exper + female
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 523 111.345
## 2 522 96.036 1 15.309 83.211 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What does the output mean? Which model should we choose?
We can also use anova to compare more than 2 models at a time.
model3<- lm(lwage~educ + exper + female + nonwhite, data = wage1)
anova(model1, model2, model3)
## Analysis of Variance Table
##
## Model 1: lwage ~ educ + exper
## Model 2: lwage ~ educ + exper + female
## Model 3: lwage ~ educ + exper + female + nonwhite
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 523 111.345
## 2 522 96.036 1 15.3089 83.0556 <2e-16 ***
## 3 521 96.031 1 0.0047 0.0255 0.8732
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What does the output mean? Which model should we choose?
We can also compare more interesting models, besides adding one variable at a time. Suppose we are interested in the effect of region on wage. Consider two models:
Here we are comparing 6 parameters to 3 parameters. Is the model with region information better? Conduct an F-test to find out.
# use an f-test to compare model 1 and model 2
region_model<- lm(lwage~educ+exper+female+northcen+south+west, data = wage1)
sub_model<- lm(lwage~educ+exper+female, data = wage1)
anova(sub_model, region_model)
## Analysis of Variance Table
##
## Model 1: lwage ~ educ + exper + female
## Model 2: lwage ~ educ + exper + female + northcen + south + west
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 522 96.036
## 2 519 94.386 3 1.6499 3.0241 0.02926 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Does adding region information help?
Now conduct t-tests for the individual coefficients in the model with region information.
# use t-tests to test each of the region coefficients for significance
summary(region_model)
##
## Call:
## lm(formula = lwage ~ educ + exper + female + northcen + south +
## west, data = wage1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.99368 -0.27329 -0.03198 0.26476 1.31710
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.530145 0.111333 4.762 2.49e-06 ***
## educ 0.090343 0.007129 12.672 < 2e-16 ***
## exper 0.009550 0.001442 6.621 8.95e-11 ***
## female -0.349004 0.037542 -9.296 < 2e-16 ***
## northcen -0.075897 0.054089 -1.403 0.161
## south -0.081769 0.050412 -1.622 0.105
## west 0.064904 0.059950 1.083 0.279
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4265 on 519 degrees of freedom
## Multiple R-squared: 0.3637, Adjusted R-squared: 0.3563
## F-statistic: 49.44 on 6 and 519 DF, p-value: < 2.2e-16
Are the region variables significant? Does this contradict the conclusion from the F-test?
It is often helpful to use polynomial terms and interaction terms in a regression model. Homework 4 will explore this more in depth.
When should we add an interaction?
Based on the variables in the dataset and your knowledge of labor economics, discuss possbile interactions that we might find in the data and want to include in a model. How might we visualize these interactions?
# visualize potential interaction effects:
# no intercept difference, clear interaction
ggplot(wage1, aes(x = tenure, y = lwage)) + geom_point() +
facet_wrap(~nonwhite, scales = "free_x") +
geom_smooth(method = "lm", se = F)
ggplot(wage1, aes(x = exper, y = lwage)) + geom_point() +
facet_wrap(~female, scales = "free_x") +
geom_smooth(method = "lm", se = F)
ggplot(wage1, aes(x = exper, y = lwage)) + geom_point() +
facet_wrap(~trade, scales = "free_x") +
geom_smooth(method = "lm", se = F)
# clear intercept difference, no clear interaction effect
ggplot(wage1, aes(x = educ, y = lwage)) + geom_point() +
facet_wrap(~female, scales = "free_x") +
geom_smooth(method = "lm", se = F)
How can we determine whether an interaction is signficant?
model<- lm(lwage ~ educ+exper+female+nonwhite+
exper*female + exper*nonwhite, data = wage1)
summary(model)
##
## Call:
## lm(formula = lwage ~ educ + exper + female + nonwhite + exper *
## female + exper * nonwhite, data = wage1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.95494 -0.26072 -0.04278 0.25606 1.24445
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.365701 0.110111 3.321 0.000959 ***
## educ 0.092475 0.007151 12.932 < 2e-16 ***
## exper 0.015200 0.002065 7.361 7.2e-13 ***
## female -0.170390 0.060154 -2.833 0.004797 **
## nonwhite 0.124689 0.100513 1.241 0.215342
## exper:female -0.010182 0.002763 -3.686 0.000252 ***
## exper:nonwhite -0.007675 0.004587 -1.673 0.094863 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.424 on 519 degrees of freedom
## Multiple R-squared: 0.371, Adjusted R-squared: 0.3637
## F-statistic: 51.02 on 6 and 519 DF, p-value: < 2.2e-16
How do we interpret these coefficients?
Holding education constant, for each additional year of experience, how much would we expect wages to increase for a white male? How much for a white female? For a nonwhite male? For a nonwhite female?
Why might we want to include a quadratic term? Consider the tenure variable in this dataset. Does it appear linear?
ggplot(wage1, aes(x= tenure, y = log(wage))) + geom_point()
# How should we model this variable? line or curve?
ggplot(wage1, aes(x= tenure, y = log(wage))) + geom_point()+
geom_smooth(method = "loess", se = F, span = 1) +
geom_smooth(method = "lm", se = F, color = "red")
What is the intuition behind the quadratic trend?
What other variables might we consider modeling with a quadratic?
Let’s include the square of tenure in our model:
model<- lm(lwage~educ+tenure+tenursq, data = wage1)
summary(model)
##
## Call:
## lm(formula = lwage ~ educ + tenure + tenursq, data = wage1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.07720 -0.28197 -0.02346 0.26859 1.41509
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3688491 0.0908138 4.062 5.62e-05 ***
## educ 0.0852822 0.0068978 12.364 < 2e-16 ***
## tenure 0.0510784 0.0067937 7.518 2.43e-13 ***
## tenursq -0.0009941 0.0002463 -4.036 6.24e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4365 on 522 degrees of freedom
## Multiple R-squared: 0.3294, Adjusted R-squared: 0.3256
## F-statistic: 85.49 on 3 and 522 DF, p-value: < 2.2e-16
How can we interpret these coefficients?
Interactions are important. What happens when we fail to account for an interaction effect?
First, let’s set up a simulation and check that coefficients are unbiased in the multiple regression setting when we have interactions.
# True parameters (unknown)
beta0 = 32
beta1 = 1.5
beta2 = 0.1
beta3 = 3
# draw data according to the linear regression model.
n_obs <- 100 # number of observations
sig <- 6 # std error of the y's
# draw x's: they will be fixed for our experiment
x1 <- runif(n_obs, 0, 10)
x2 <- round(runif(n_obs, 0, 1))
For example, we might consider our \(x\) variables to be
draw_data <- function(x1, x2, beta0, beta1, beta2, beta3, sig){
n_obs <- length(x1)
# draw y's
y = rnorm(n_obs, mean = beta0 + beta1*x1 + beta2*x2 + beta3*x1*x2, sd = sig)
return(data.frame(x1 = x1, x2 = x2, y = y))
}
dataset <- draw_data(x1,x2, beta0, beta1, beta2, beta3, sig)
head(dataset)
## x1 x2 y
## 1 2.544427 1 47.45671
## 2 6.412047 1 71.08616
## 3 9.056090 0 44.11083
## 4 6.105229 0 42.27927
## 5 1.391444 1 39.56522
## 6 8.889208 0 43.43209
Let’s compute the regression coefficients using our simulated dataset:
# Compute the regression coefficients from simulated data:
model<- lm(y ~ x1 + x2 + x1*x2, data = dataset)
summary(model)
##
## Call:
## lm(formula = y ~ x1 + x2 + x1 * x2, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.4921 -3.0324 0.3488 3.3919 16.2117
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.9819 1.6712 18.539 < 2e-16 ***
## x1 1.6544 0.2775 5.962 4.13e-08 ***
## x2 2.4557 2.3401 1.049 0.297
## x1:x2 2.9182 0.3951 7.385 5.54e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.518 on 96 degrees of freedom
## Multiple R-squared: 0.8493, Adjusted R-squared: 0.8446
## F-statistic: 180.4 on 3 and 96 DF, p-value: < 2.2e-16
# notice that estimates are not spot on, but within standard of error
# also x2 is not significant
Is it unbiased? We can run a simulation to check:
# draw multiple datasets from the model
# and for each dataset, recompute the regression coefficients.
n_trials <- 200
intercepts <- rep(0, n_trials)
X1_coefs <- X2_coefs <- X1X2_coefs<- rep(0, n_trials)
for(i in 1:n_trials){
dataset <- draw_data(x1, x2, beta0, beta1, beta2, beta3, sig)
model <- lm(y ~ x1 + x2 + x1*x2, data = dataset)
intercepts[i] <- model$coefficients[1]
X1_coefs[i] <- model$coefficients[2]
X2_coefs[i] <- model$coefficients[3]
X1X2_coefs[i]<- model$coefficients[4]
}
What are the distributions of our estimated coefficients?
# examine distribution of the coefficients
hist(intercepts)
abline(v = beta0, col = "red")
hist(X1_coefs)
abline(v=beta1, col = "red")
hist(X2_coefs)
abline(v=beta2, col = "red")
hist(X1X2_coefs)
abline(v=beta3, col = "red")
Do our estimates appear unbiased?
What happens when we forget to consider the interaction? Let’s see what happens when we get the model wrong.
lm(y ~ x1 + x2, data = dataset)
##
## Call:
## lm(formula = y ~ x1 + x2, data = dataset)
##
## Coefficients:
## (Intercept) x1 x2
## 22.697 3.219 15.996
# yikes
How bad is this? Run the simulation again, but this time fit a model only on x1 and x2, ignoring the interaction. Examine the distribution of the coefficients.
# run the simulation again, still drawing data from the model y~x1+x2+x1*x2,
# but this time fit a model only on x1+x2
intercepts <- X1_coefs<- X2_coefs<- rep(0, n_trials)
for(i in 1:n_trials){
dataset <- draw_data(x1, x2, beta0, beta1, beta2, beta3, sig)
wrong_model <- lm(y ~ x1+x2, data = dataset)
intercepts[i] <- wrong_model$coefficients[1]
X1_coefs[i] <- wrong_model$coefficients[2]
X2_coefs[i] <- wrong_model$coefficients[3]
}
# see what happens to the coefficients
beta0
## [1] 32
mean(intercepts)
## [1] 24.20311
beta1
## [1] 1.5
mean(X1_coefs)
## [1] 2.987696
beta2
## [1] 0.1
mean(X2_coefs)
## [1] 15.67322
hist(intercepts, breaks = 15, xlim = c(min(intercepts), 32))
abline(v = beta0, col = "red")
hist(X1_coefs, breaks = 15, xlim = c(1, max(X1_coefs)))
abline(v = beta1, col = "red")
hist(X2_coefs, breaks = 15, xlim = c(0, max(X2_coefs)))
abline(v = beta2, col = "red")
Are the coefficients still unbiased if we get the model wrong (if we don’t include \(x1*x2\) in the model)?
If we forget the interaction, how do our conclusions change? How might these be erroneous?
Discuss any lingering problems.
Wooldridge, J.~M. (2000), Introductory Econometrics: A Modern Approach, South-Western College Publishing.↩